Simple Methods of Finding Short Protein Coding Sequences
نویسندگان
چکیده
Eukaryotic genomes contain many conserved regions of unknown function. Accurately assessing the protein coding potential of these regions is a key step in annotation. We develop three protein coding measures that directly assess conserved regions in multiple sequence alignments of many species: one based on phase-shifts induced by alignment gaps, another based on the 3rd position mutation asymmetry in codons, and a third based on nucleotide composition asymmetry. The methods are easy to implement and require no training. Using a human-chimp-rat-mouse-chicken multiple alignment, these measures can classify coding regions as short as 30nt with greater specificity than single-genome measures using 120nt. Results from human-mouse and humanchicken alignments can be further improved by considering additional species; only the chimp genome proved uninformative. The phase-shift method is especially accurate. Contact: [email protected], [email protected]
منابع مشابه
Long non-coding RNAs and their significance in human diseases
Protein-coding genes account for only a small fraction of the human genome and most of the genomic sequences are transcriptionally silent, but recent observations indicate significant functional elements, including non-coding protein transcripts in the human genome. Long non-coding RNAs (lncRNAs) have been defined as transcripts of >200 nucleotides without protein-coding capacity that perform t...
متن کاملPhylogenetic Analysis of Three Long Non-coding RNA Genes: AK082072, AK043754 and AK082467
Now, it is clear that protein is just one of the most functional products produced by the eukaryotic genome. Indeed, a major part of the human genome is transcribed to non-coding sequences than to the coding sequence of the protein. In this study, we selected three long non-coding RNAs namely AK082072, AK043754 and AK082467 which show brain expression and local region conservation among vertebr...
متن کاملCloning and Characterization of cbhII Gene fromTrichoderma parceramosum and Its Expressionin Pichia pastoris
The genomic and cDNA clones encoding cellobiohydrolase II (CBHII) have been isolated and sequenced from a native Iranian isolate of Trichoderma parceramosum, a high cellulolytic enzymes producer isolate. This represents the first report of cbhII gene from this organism. Comparison of genomic and cDNA sequences indicates this gene contains three short introns and also an open reading frame codin...
متن کاملMining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM
Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...
متن کاملDetection of Protein Coding Sequences Using a Mixture Model for Local Protein Amino Acid Sequence
Locating protein coding regions in genomic DNA is a critical step in accessing the information generated by large scale sequencing projects. Current methods for gene detection depend on statistical measures of content differences between coding and noncoding DNA in addition to the recognition of promoters, splice sites, and other regulatory sites. Here we explore the potential value of recurren...
متن کامل